1 Getting Started

1.1 Setup

Welcome! While we’re waiting:

  • Clone or download the workshop files from: https://github.com/dlab-geo/rCensus_workshop
    • If you downloaded the zipfile, unzip it.
    • Make a note of the folder in which the files reside.
  • Open RStudio

  • Open a new R script file

1.2 Introduction

  • About me

  • About you
    • Your familiarity with US Census data
    • with geospatial data
    • with geospatial data in R

1.3 Outline

  • Describe primary Census data products

  • Introduce R packages for working with Census Data

  • Use those packages to fetch census data

  • Use those packages to fetch census data plus census geograpic boundary files

  • Make maps of census data

2 Census Data Overview

2.1 US Census Data

The “nation’s leading provider of quality data about its people and economy.”

Available at www.census.gov

2.2 Primary Census Products

  • Decennial Census

  • American Community Survey (ACS)

2.3 Decennial Census

Complete count of the population every 10 years since 1790

Includes data on

  • population, by age & race/ethnicity

  • housing, by occupancy & tenure (owned, rented)

2.4 American Community Survey (ACS)

  • Annual survey of a sample of about 3 million household

  • Provides estimates of demographic, social, economic & housing characteristics

  • Includes margin of error values for the estimates.

2.5 Decennial Census* vs ACS Data

Demographic* Social Economic Housing
Sex Families Income Tenure*
Age Education Benefits Occupancy*
Race Marital Status Employment Status Structure Type
Hispanic Origin Fertility Occupation Housing Value
Grandparents Industry Taxes & Insurance
Veterans Commuting Utilities
Disability Status Place of Work Mortgage
Language at Home Health Insurance Monthly Rent
Citizenship
Mobility

2.6 Census Geographies

Census data are publicly available at one or more levels of geographic aggregation.

2.7 Census Data & Census Geographies

2.9 Census Data Workflow

Identify your

  • topic of interest
  • year(s)
  • geographic level of detail
  • for what locations?

Then determine what specific tables and variables are available - ACS or Decennial?

2.10 CAUTION

“If you want to measure change you can’t change the measures!”

Census tables, variables, geographies, and geographic boundaries change over time!

Measuring change over time with census data is it’s own thing, complex and not covered by this workshop!

3 R Packages

3.1 Packages for Working with Census Data

These are the ones we recommend and will use today.

4 tidycensus & tigris

4.1 tidycensus

Functions for accessing census decennial and ACS 5 year datasets via Census APIs

  • only a subset of datasets / years available
  • requires a Census API key

4.2 tidycensus

Limited set of years available via tidycensus

  • decennial census: 1990, 2000, and 2010
  • ACS 5 yr: 2005-2010 through 2012-2016 are available.
  • Note: tidycensus referes to ACS 5year datasets by the endyear.
  • 2013 - 2017 released Dec 6, 2018 by the Census.
    • Need to check its availability in tidycensus.

4.3 tigris

Provides access to census geographic data files

  • detailed TIGER/Line boundary files (e.g., shapefiles), or
  • simplified Cartographic boundary files

Also provides access to census feature data,

  • eg, rivers, roads, coastlands, landmarks, and more

Used by tidycensus to access state, county, tract, block group, block, and ZCTA boundaries.

  • Use tigris directly to access other census geographic data.

4.4 tidycensus & tigris

Packages developed by Kyle Walker to make it easier to fetch data from Census websites and APIs in R and get that data in a useable format to analyze, plot, and map.

Check out his website to keep abreast of his great packages, blog posts and tutorials.

Walker also develped a new DataCamp course Analyzing US Census Data in R!

  • Highly recommended! First chapter free!

4.5 tidyverse

A collection of R Packages for data science - developed primarily by Hadley Wickham, Chief Scientist at RStudio.

  • dplyr and tidyr for reshaping data

  • ggplot2 for plotting

  • purr, readr and tibble for improved performance

These packages are used by tidyverse under the hood.

4.6 sf

Simple features for geospatial data objects and methods.

  • Next generation R package for working with vector geospatial data
    • will soon supercede the sp package

sf includes the functionality of the sp, rgdal, rgeos and proj4 packages.

  • but with improved performance, simplified command syntax and easier workflows.

4.7 Alternatives to Accessing Census Data in R

You can write code to access the Census APIs directly.

You can download Census data directly from:

You can download Census geographic data directly on the census website

5 Tutorial Time!

5.1 Part 1

We will work through several exercises using tidycensus to fetch, wrangle and map census data.

5.2 Loading packages

Load the packages we will use today

library(tidycensus)
library(tidyverse) 
library(tigris)
library(sf)

5.3 Install any packages that you do not have on your computer

Also install any dependancies.

# install.packages("tidyverse")
# install.packages("tidycensus")
# install.packages("sf")

5.4 Census API Key

You need a census API key to programmatically fetch census data.

Get it here (pretty quick):

For more info see:

5.5 Install your Census API Key

Use the tidycensus function census_api_key to make tidycensus use your key when it fetches data from the census.

# Install your census api key - long alphanumeric string
census_api_key(THE_BIG_LONG_ALPHANUMERIC_API_KEY_YOU_GOT_FROM_CENSUS)

5.6 Set working directory

Be sure to Clone or downloaded & unzip the workshop files from: https://github.com/dlab-geo/rCensus_workshop

  • unzip if needed

THen, set your working directory this folder, e.g.,

  • setwd("~/Documents/Dlab/workshops/2019/rCensus_workshop")

6 Fetching Decennial Census Data

6.1 Population Data

Let’s start by fetching population data from the 2010 Census for all states

In order to fetch census data you need to identify the census variables that contain the data of interest.

6.2 Topics, Tables & Variables

Census data variables are organized in tables

Which are organized by topic or concept.

The tidycensus load_variables function can help with this step.

First, take a look at the function documentation.

?load_variables

6.3 load_variables

Use load_variables to fetch all variables used in the 2010 census into a dataframe.

vars2010 <- load_variables(year=2010,        # Year or end year for ACS
                           dataset = 'sf1',  # 'sf1' for decennial or 'acs5'
                           cache = TRUE)     # Whether to save fetched data locally

6.4 Decennial Census Variables

Let’s take a look at and discuss the resultant dataframe.

  • How many 2010 census variables are in the dataframe?
View(vars2010)

6.5 2010 Decennial Census Tables

  • Variables: 3,346

  • Topics: Population, housing

  • Tables: currenty 333 - that’s a lot!
    • 177 population tables (identified with a ‘‘P’’) available to the block level
    • 58 housing tables (identified with an ‘‘H’’) available to the block level
    • 82 population tables (identified with a ‘‘PCT’’) available to the census tract level
    • 4 housing tables (identified with an “HCT”) available to the census tract level
    • 10 population tables (identified with a “PCO”) available to the county level
    • plus 2 additoinal PCT tables

6.6 What Variable has the 2010 Total Population value?

We can sort and filter the vars2010 dataframe to find it.

6.7 get_decennial

We can use the tidycensus function get_decenial to fetch the 2010 census data for total population by state.

First, check the documentation for the function.

?get_decennial

6.8 get_decennial

Fetch total population by state (P001001) from the 2010 census using get_decennial.

pop2010 <- get_decennial(geography = "state",   # census tabulation unit
                         variables = "P001001", # variable(s) of interest
                         year = 2010)           # census year
## Getting data from the 2010 decennial Census

6.9 View the Data

  • How many rows and columns?

  • Do you see the expected number of states?

  • What column contains the population counts?

  • Do the data values see to be right?

#pop2010

6.10 Visualize results

We can visualize the data to get a quick overview of the distribution of data values.

It’s a first step in exploratory data analysis and a last step in data communication.

ggplot2 is the most commonly used R package for data visualization.

  • It is loaded when you load the tidyverse package.

Let’s use it to visualize the population data.

6.11 Plot 2010 Population by state

Use ggplot2 to create an ordered horizontal bar chart.

pop_plot<- ggplot(data=pop2010, aes(x=reorder(NAME,value), y=value/1000000)) + 
  geom_bar(stat="identity") + coord_flip() +
  theme_minimal() + 
  labs(title = "2010 US Population by State") +
  xlab("State") +
  ylab("in millions")

6.12 Display the plot

6.13 Challenge

Fetch population data by state for 2000.

Don’t assume variable names are the same across years. Check first!

6.14 Challenge Solution

Total Population in 2000

# What is the variable name in 2000?
vars2000 <- load_variables(year=2000, dataset = 'sf1', cache = T)

# Take a look and search in the dataframe
View(vars2000)

# Fetch the 2000 pop data
pop2000 <- get_decennial(geography = "state", variables = "P001001", year = 2000)

# Take a look (plot if time)
pop2000

6.15 Limiting by Area of Interest

In the previous example we retrieved population data for all states.

  • This is the default behavior if you don’t specify a subset.

  • But you can limit the data to be retrieved by subunits like state.

6.16 Limit Areas of Interest

Let’s fetch data for just 3 states.

state_pop2010 <- get_decennial(geography = "state", # census tabulation unit
                         variables = "P001001",     # variables of interest
                         year = 2010,               # census year
                         state=c("CA","OR","WA"))   # Filter by states of interest
## Getting data from the 2010 decennial Census

Note we are referencing states by their abbrevation.

6.17 View Results

state_pop2010
## # A tibble: 3 x 4
##   GEOID NAME       variable     value
##   <chr> <chr>      <chr>        <dbl>
## 1 06    California P001001  37253956.
## 2 41    Oregon     P001001   3831074.
## 3 53    Washington P001001   6724540.

6.18 Changing Census Tabulation unit

get_decennial accepts a number of different values for tabulation unit.

  • Options include: state, county, tract, block group, block, and ZCTA.

Let’s change the tabulation unit from state to county.

co_pop2010 <- get_decennial(geography = "county",   # census tabulation unit
                            variables = "P001001",  # variables of interest
                            year = 2010)
## Getting data from the 2010 decennial Census

6.19 Changing Census Tabulation unit

View the county data to see what was retrieved.

co_pop2010
## # A tibble: 3,221 x 4
##    GEOID NAME                        variable   value
##    <chr> <chr>                       <chr>      <dbl>
##  1 05131 Sebastian County, Arkansas  P001001  125744.
##  2 05133 Sevier County, Arkansas     P001001   17058.
##  3 05135 Sharp County, Arkansas      P001001   17264.
##  4 05137 Stone County, Arkansas      P001001   12394.
##  5 05139 Union County, Arkansas      P001001   41639.
##  6 05141 Van Buren County, Arkansas  P001001   17295.
##  7 05143 Washington County, Arkansas P001001  203065.
##  8 05145 White County, Arkansas      P001001   77076.
##  9 05149 Yell County, Arkansas       P001001   22185.
## 10 06011 Colusa County, California   P001001   21419.
## # ... with 3,211 more rows

6.20 Challenge

  • Fetch population by county for just California

  • Fetch population by county for Oregon & California

Try it before you look ahead at solutions.

6.21 Challenge Solution

## Fetch population by **county** for just California
co_pop2010_ca <- get_decennial(geography = "county",   # census tabulation unit
                            variables = "P001001",  # variables of interest
                            year = 2010,
                            state=c('CA'))
## Getting data from the 2010 decennial Census
#co_pop2010_ca

## Fetch population by **county** for Oregon & California
co_pop2010_caor <- get_decennial(geography = "county",   # census tabulation unit
                               variables = "P001001",  # variables of interest
                               year = 2010,
                               state=c('CA','OR'))
## Getting data from the 2010 decennial Census
co_pop2010_caor
## # A tibble: 94 x 4
##    GEOID NAME                            variable    value
##    <chr> <chr>                           <chr>       <dbl>
##  1 06011 Colusa County, California       P001001    21419.
##  2 06007 Butte County, California        P001001   220000.
##  3 06001 Alameda County, California      P001001  1510271.
##  4 06003 Alpine County, California       P001001     1175.
##  5 06005 Amador County, California       P001001    38091.
##  6 06009 Calaveras County, California    P001001    45578.
##  7 06013 Contra Costa County, California P001001  1049025.
##  8 06015 Del Norte County, California    P001001    28610.
##  9 06031 Kings County, California        P001001   152982.
## 10 06021 Glenn County, California        P001001    28122.
## # ... with 84 more rows

6.22 Challenge

  • Fetch population by tract for all states.

  • Fetch population by tract for California.

6.23 Challenge Solution

## Fetch population by **tract** for California.
cal_pop2010_tracts <- get_decennial(geography = "tract",   # census tabulation unit
                                 variables = "P001001",  # variables of interest
                                 year = 2010,
                                 state=c('CA'))
cal_pop2010_tracts


## Fetch population by **tract** for all states.
pop2010_tracts <- get_decennial(geography = "tract",   # census tabulation unit
                                    variables = "P001001",  # variables of interest
                                    year = 2010)

pop2010_tracts  ## DOES THIS WORK?

6.24 Fetching Census Tract Data

If you want census data at the tract level or below you must specifiy the state & county or counties.

tract_pop2010 <- get_decennial(geography = "tract",   # census tabulation unit
                         variables = "P001001",       # variable of interest
                         year = 2010,                 # census year
                         state="CA",                  # limit to state of California
                         county=c("Alameda","Contra Costa"))  # and only these counties
## Getting data from the 2010 decennial Census
## Getting data from the 2010 decennial Census
## Getting data from the 2010 decennial Census

6.25 Fetching Census Tract Data

View the results! How many census tracts are in these 3 counties?

tract_pop2010
## # A tibble: 569 x 4
##    GEOID       NAME                                         variable value
##    <chr>       <chr>                                        <chr>    <dbl>
##  1 06001400100 Census Tract 4001, Alameda County, Californ… P001001  2937.
##  2 06001400200 Census Tract 4002, Alameda County, Californ… P001001  1974.
##  3 06001400300 Census Tract 4003, Alameda County, Californ… P001001  4865.
##  4 06001400400 Census Tract 4004, Alameda County, Californ… P001001  3703.
##  5 06001400500 Census Tract 4005, Alameda County, Californ… P001001  3517.
##  6 06001400600 Census Tract 4006, Alameda County, Californ… P001001  1571.
##  7 06001400700 Census Tract 4007, Alameda County, Californ… P001001  4206.
##  8 06001400800 Census Tract 4008, Alameda County, Californ… P001001  3594.
##  9 06001400900 Census Tract 4009, Alameda County, Californ… P001001  2302.
## 10 06001401000 Census Tract 4010, Alameda County, Californ… P001001  5678.
## # ... with 559 more rows

6.26 Challenge

  1. Fetch population by county for Alameda County, California

  2. Fetch population by tract for the nine county Bay Area:
  • Alameda, SF, Contra Costa, Marin County, Napa,
  • San Mateo, Santa Clara, Solano, Sonoma, Santa Cruz

Note: You can use names, abbreviations or FIPs codes for your state and county.

# County FIPs Codes for
# Alameda, SF, Contra Costa, Marin County, Napa, 
# San Mateo, Santa Clara,  Solano,  Sonoma, santa cruz
nine_counties <- c("001", "075", "013", "041", "055", "081", "085", "095", "097")

6.27 Challenge Solution

#  population by **county** for Alameda County, California
alco_pop2010 <- get_decennial(geography = "county",   # census tabulation unit
                                 variables = "P001001",  # variables of interest
                                 year = 2010,
                                 state=c('CA'),
                                 county=c('Alameda County'))
## Getting data from the 2010 decennial Census
#alco_pop2010

6.28 Challenge Solution

Fetch population by tract for the nine county Bay Area

# County FIPs Codes for
# Alameda, SF, Contra Costa, Marin County, Napa, 
# San Mateo, Santa Clara,  Solano,  Sonoma, santa cruz
nine_counties <- c("001", "075", "013", "041", "055", "081", "085", "095", "097")

bayarea_pop2010_tract <- get_decennial(geography = "tract",   # census tabulation unit
                         variables = "P001001",       # variable of interest
                         year = 2010,                 # census year
                         state="CA",                  # limit to state of California
                         county=nine_counties)  # and only these counties
## Getting data from the 2010 decennial Census
## Getting data from the 2010 decennial Census
## Getting data from the 2010 decennial Census
## Getting data from the 2010 decennial Census
## Getting data from the 2010 decennial Census
## Getting data from the 2010 decennial Census
## Getting data from the 2010 decennial Census
## Getting data from the 2010 decennial Census
## Getting data from the 2010 decennial Census
## Getting data from the 2010 decennial Census
#bayarea_pop2010_tract

6.29 RECAP & QUESTIONS

Fetch population by tract for the nine county Bay Area

# County FIPs Codes for
# Alameda, SF, Contra Costa, Marin County, Napa, 
# San Mateo, Santa Clara,  Solano,  Sonoma, santa cruz
nine_counties <- c("001", "075", "013", "041", "055", "081", "085", "095", "097")

bayarea_pop2010 <- get_decennial(geography = "tract",   # census tabulation unit
                      variables = "P001001",            # variable of interest
                      year = 2010,                      # census year
                      state="CA",                       # limit to state of California
                     county=nine_counties)             # and only these counties

# View the data
bayarea_pop2010

6.30 Fetching data for more than one census variable

What three things are new here?

#urban rural pop for 3 counties
ur_pop10 <- get_decennial(geography = "county",  # census tabulation unit
                           variables = c(urban="P002002",rural="P002005"),
                           year = 2010, 
                           summary_var = "P002001",  # The denominator
                           state='CA',
                           county=c("Napa","Sonoma","Mendocino"))
## Getting data from the 2010 decennial Census

6.31 Fetching data for more than one census variable

What three things are new here?

  1. You can specify more than one variable:

    variables = c("P002002","P002005")
  2. You can name the output columns.

    variables = c(urban="P002002",rural="P002005")
  3. You can identify a summary_var.

    summary_var = "P002001"

This value is the denominator - the total count of all people or households surveyed. The values in this column can be used as a demoninator for other calcuations like percent of total.

6.32 Take a look at the results

ur_pop10
## # A tibble: 6 x 5
##   GEOID NAME                         variable   value summary_value
##   <chr> <chr>                        <chr>      <dbl>         <dbl>
## 1 06045 Mendocino County, California urban     48110.        87841.
## 2 06055 Napa County, California      urban    118194.       136484.
## 3 06097 Sonoma County, California    urban    424102.       483878.
## 4 06045 Mendocino County, California rural     39731.        87841.
## 5 06055 Napa County, California      rural     18290.       136484.
## 6 06097 Sonoma County, California    rural     59776.       483878.

6.33 Calculating Percents

The summary_value column comes in handy when you want to compute percent of total.

Here’s one way to do it.

# Calculate the percent of population that is Urban or Rural
ur_pop10 <- ur_pop10 %>%
            mutate(pct = 100 * (value / summary_value))
## Warning: package 'bindrcpp' was built under R version 3.4.4

6.34 Calculating Percents

Let’s take a look at the output

ur_pop10 # Take a look
## # A tibble: 6 x 6
##   GEOID NAME                         variable   value summary_value   pct
##   <chr> <chr>                        <chr>      <dbl>         <dbl> <dbl>
## 1 06045 Mendocino County, California urban     48110.        87841.  54.8
## 2 06055 Napa County, California      urban    118194.       136484.  86.6
## 3 06097 Sonoma County, California    urban    424102.       483878.  87.6
## 4 06045 Mendocino County, California rural     39731.        87841.  45.2
## 5 06055 Napa County, California      rural     18290.       136484.  13.4
## 6 06097 Sonoma County, California    rural     59776.       483878.  12.4

6.35 Plot it

Plots give us compact visual summaries of the data

myplot <- ggplot(data = ur_pop10, 
          mapping = aes(x = NAME, fill = variable, 
                     y = ifelse(test = variable == "urban", 
                                yes = -pct, no = pct))) +
          geom_bar(stat = "identity") +
          scale_y_continuous(labels = abs, limits=c(-100,100)) +
          labs(title="Urban & Rural Population in Wine Country", 
               x="County", y = " Percent of Population", fill="") +
          coord_flip()

Don’t worry if you don’t get all the ggplot code now. It’s here for reference.

6.36 Plot it

myplot

6.37 Fetch all the data in one table

This is often helpful but you need to keep tract of the meaning of each variable.

alco_pop10 <- get_decennial(geography = "tract", # Census tabulation unit
                           table =  "P002",      # Table of urban & rural population counts
                           year = 2010,          # Decennial census year
                           state='CA',           # Filter state
                           county="Alameda")     # Filter county
## Getting data from the 2010 decennial Census

6.38 Take a look

unique(alco_pop10$variable) # What and how many unique vars in table?
## [1] "P002001" "P002002" "P002003" "P002004" "P002005" "P002006"
head(alco_pop10,3)  # Take a look at output
## # A tibble: 3 x 4
##   GEOID       NAME                                          variable value
##   <chr>       <chr>                                         <chr>    <dbl>
## 1 06001400100 Census Tract 4001, Alameda County, California P002001  2937.
## 2 06001400200 Census Tract 4002, Alameda County, California P002001  1974.
## 3 06001400300 Census Tract 4003, Alameda County, California P002001  4865.

6.39 Output options

Let’s try all three of these commands and then look at the ouput to see what’s different?

get_decennial(geography = "state", variables = "P001001", year = 2010)

get_decennial(geography = "state", variables = c(pop10="P001001"), year = 2010)

get_decennial(geography = "state", variables = c(pop00="P001001"), year = 2010, 
              output="wide")

6.40 Output options

head(get_decennial(geography = "state", variables = "P001001", year = 2010),2)
## Getting data from the 2010 decennial Census
## # A tibble: 2 x 4
##   GEOID NAME    variable    value
##   <chr> <chr>   <chr>       <dbl>
## 1 01    Alabama P001001  4779736.
## 2 02    Alaska  P001001   710231.
head(get_decennial(geography = "state", variables = c(pop10="P001001"), year = 2010),2)
## Getting data from the 2010 decennial Census
## # A tibble: 2 x 4
##   GEOID NAME    variable    value
##   <chr> <chr>   <chr>       <dbl>
## 1 01    Alabama pop10    4779736.
## 2 02    Alaska  pop10     710231.
head(get_decennial(geography = "state", variables = c(pop00="P001001"), year = 2010, output="wide"), 2)
## Getting data from the 2010 decennial Census
## # A tibble: 2 x 3
##   GEOID NAME       pop00
##   <chr> <chr>      <dbl>
## 1 01    Alabama 4779736.
## 2 02    Alaska   710231.

6.41 Data Wrangling

Your R skills can help you reformat the data and make it more useable.

Let’s fetch population data for 2010 & 2000 by state with output=wide.

  • We will label the variables pop00 and pop10.

Then we will combine these into one data frame.

6.42 Data Wrangling

Fetch pop by state from both the 2000 and 2010 census

pop2000 <- get_decennial(geography = "state", variables = c(pop00="P001001"), 
                         year = 2000, output="wide")
## Getting data from the 2000 decennial Census
pop2010 <- get_decennial(geography = "state", variables = c(pop10="P001001"), 
                         year = 2010, output="wide")
## Getting data from the 2010 decennial Census

6.43 Merge population by state from both censuses

Save in a new dataframe with both columns

pop2000_2010 <- pop2000 %>% merge(pop2010, by="NAME") %>%
                             select(NAME, pop00, pop10)

head(pop2000_2010,3)
##      NAME   pop00   pop10
## 1 Alabama 4447100 4779736
## 2  Alaska  626932  710231
## 3 Arizona 5130632 6392017

6.44 Save the data

Use write.csv to save a data frame to a CSV file.

write.csv(pop2000_2010, file="pop2000_2010.csv", row.names = FALSE)

7 QUESTIONS?

8 Part 2. Mapping

8.1 Mapping Census Data with tidycensus

You can fetch geographic data by adding the parameter geometry=TRUE to tidycensus functions

  • Under the hood, tidycensus calls the tigris package to fetch data from the Census Geographic Data APIs.

  • Only a subset of data available via tigris can be accessed via tidycensus.

You can then use common mapping functions like plot, ggplot and tmap to make maps.

8.2 Geometry Options

Before fetching geometry, we need to specify a few tigris options

  • Set the class of returned data to be sf objects (not sp, the default)

  • Set tigris_use_cache to TRUE

# Tigris options - used by tidycensus
options(tigris_class = "sf")      # SP is the default format returned by tigris
options(tigris_use_cache = TRUE)  # Save retrieved data locally

Caching the data is important because it speeds things up if you often fetch census data for the same geographies over and over again.

8.3 tigris cache directory

You may want to use the geographic data downloaded by tigris in other applications.

To do this, you need to know where the files are saved locally.

You can also specify where tigris should save cached data.

# Check the location of the tigris cached data
Sys.getenv('TIGRIS_CACHE_DIR') 

# Set it
tigris_cache_dir("~/Documents/gis_data/census")  # Folder for local data

# Check it again
Sys.getenv('TIGRIS_CACHE_DIR') 

8.4 Fetch geographic boundary data with tidycensus

We fetch the geospatial data by setting geometry=TRUE.

pop2010geo <- get_decennial(geography = "state", 
                          variables = c(pop10="P001001"), 
                          year = 2010, 
                          output="wide", 
                          geometry=TRUE) # Fetch geometry with the data for mapping
## Getting data from the 2010 decennial Census

8.5 Take a look

Let’s take a minute to discuss the format of an sf spatial object.

pop2010geo
## Simple feature collection with 52 features and 3 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -179.1473 ymin: 17.88481 xmax: 179.7785 ymax: 71.35256
## epsg (SRID):    4269
## proj4string:    +proj=longlat +datum=NAD83 +no_defs
## # A tibble: 52 x 4
##    GEOID NAME            pop10                                    geometry
##    <chr> <chr>           <dbl>                          <MULTIPOLYGON [°]>
##  1 01    Alabama      4779736. (((-85.00237 31.00068, -85.02411 31.00068,…
##  2 02    Alaska        710231. (((-164.9762 54.13459, -164.9378 54.13668,…
##  3 04    Arizona      6392017. (((-109.0452 36.99908, -109.0452 36.96949,…
##  4 05    Arkansas     2915918. (((-94.55929 36.4995, -94.51948 36.49921, …
##  5 06    California  37253956. (((-122.4463 37.86105, -122.4386 37.86837,…
##  6 22    Louisiana    4533372. (((-88.86507 29.75271, -88.88976 29.7182, …
##  7 21    Kentucky     4339367. (((-83.67541 36.60081, -83.67561 36.59858,…
##  8 08    Colorado     5029196. (((-102.0422 36.99308, -102.0545 36.99311,…
##  9 09    Connecticut  3574097. (((-71.85957 41.3224, -71.86823 41.33094, …
## 10 10    Delaware      897934. (((-75.55945 39.62981, -75.5591 39.62906, …
## # ... with 42 more rows

8.6 Geospatial Data in R

R sf objects include

  • a dataframe with a geometry column named of geometry

    • The geometry can be of type POINT, LINE, POLYGON
    • or, MULTIPOINT, MULTILINE or MULTIPOLGYON
  • a CRS (coordinate reference system), specified by
    • epsg(SRID) code
    • proj4string

8.7 Census Data Coordinate Reference System (CRS)

All census geographic data use the NAD83 CRS, or coordinate reference system.

NAD83 stands for North American Datum of 1983. The geographic coordinates are longitude and latitude values encoded as decimal degrees.

WGS84, or The World Geodetic System of 1984 is the most commonly used geographic CRS. The difference between points encoded in these two systems can vary, on average, up to 1 meter in the continental US.

Many geospatial operations require you transform data to a common CRS before conducting spatial analysis or mapping.

As an in depth discussion of CRSs is outside the scope of this workshop, see Geocomputation in R for more information.

8.8 Mapping sf Spatial Objects

We can use plot to make a quick map the geometry stored in an sf spatial object.

plot(pop2010geo$geometry)

8.9 Question

What do you get if you plot the sf object without specifying “$geometry”

8.10 The Challenge of US maps

The vast geographic extent and non-contiguous nature of the USA makes it difficult to map.

8.11 Fetch geographic data with tidycensus, SHIFTED

tidycensus includes a shift_geo parameter to shift AK & HI to below Texas.

pop2010geo_shifted <- get_decennial(geography = "state", 
                                    variables = c(pop10="P001001"), 
                                    output="wide",
                                    year = 2010, 
                                    geometry=TRUE, 
                                    shift_geo=TRUE)
## Getting data from the 2010 decennial Census
## Using feature geometry obtained from the albersusa package
## Please note: Alaska and Hawaii are being shifted and are not to scale.

8.12 Shift Happens!

plot(pop2010geo_shifted$geometry)

8.13 Save it

You can save sf data to a shapefile using st_write

st_write(pop2010geo_shifted,"usa_2010_shifted.shp")

8.14 Check your TIGRIS_CACHE_DIR to see it

my_cache_dir <- Sys.getenv('TIGRIS_CACHE_DIR') 

dir(my_cache_dir) # What files stored there?

8.15 Mapping Data Values

plot(pop2010geo_shifted['pop10'])

8.16 ggplot2 Maps

ggplot(pop2010geo_shifted, aes(fill = pop10)) + 
  geom_sf()

8.17 ggplot2 Maps

Note the use of geom_sf which tells ggplot that spatial data objects are being mapped. - this is a huge improvememnt!!

ggplot(pop2010geo_shifted, aes(fill = pop10)) + 
  geom_sf()

8.18 Challenge

Create a map of CA Population in 2010 by county

8.19 Challenge Solution

2010 pop Data for California Counties

#fetch it
cal_pop10 <- get_decennial(geography = "county", 
                           variables = "P001001",
                           year = 2010, 
                           state='CA',
                           geometry=TRUE)

# map it
#plot(cal_pop10['value'])

8.20 Fetch County data for more than one state

We can fetch both the census data and the geometry for more than one state!

  • this is so much easier than any alternative approach!
west_pop10 <- get_decennial(geography = "county", 
                           variables =  "P001001",
                           year = 2010, 
                           state=c('CA','OR','NV',"AZ"),
                           geometry=T)
## Getting data from the 2010 decennial Census

8.21 Map it

These are just quick plots to make sure we got the right data!

plot(west_pop10['value'])

8.22 Census Tract Data

Fetching the data for all tracts in one state.

  • but you need to specify one or more counties.
# Fetch tract data 
alco_pop10 <- get_decennial(geography = "tract", 
                           variables = "P001001", 
                           year = 2010, 
                           state='CA',
                           county='Alameda',
                           geometry=T)
## Getting data from the 2010 decennial Census

8.23 Challenge

Fetch and map the 2010 population by census tract for Alameda and Countra Costa counties.

8.24 Challenge Solution

Fetch Tract population & geometry data for Alameda & Contra Costa Counties

alcc_pop10 <- get_decennial(geography = "tract", 
                      variables = "P001001", 
                      year = 2010, 
                      state='CA',
                      county=c("Alameda","Contra Costa"),
                      geometry=T) 
## Getting data from the 2010 decennial Census
## Getting data from the 2010 decennial Census
## Getting data from the 2010 decennial Census

8.25 Challenge Solution

Map it

plot(alcc_pop10['value'])

8.26 More Complex Challenge (if time)

Fetch and map the percent of San Francicso properties by census tract that were coded as rented in the 2010 Census.

To start, indentify the variables for the

  • total number of hounsing units

  • number of renter occupied units

8.27 Complex Challenge Solution

SF Rented Units, 2010

sf_rented <- get_decennial(geography = "tract",  # census tabulation unit
                           variables =  "H004004",
                           year = 2010, 
                           summary_var = "H004001",  # Total Urban - the denominator
                           state='CA',
                           county='San Francisco',
                           geometry=T)

sf_pct_rented <- sf_rented[sf_rented$value > 0,] %>%
                 mutate(pct = 100 * (value / summary_value))

plot(sf_pct_rented['pct'])

9 Questions?

10 Part 3. ACS 5 year data

10.1 ACS Data with tidycensus

The tidycensus workflow for ACS data is similar to that used for decennial census data.

  • But there are many more variables in the ACS.

Because the ACS contains sample data, each ACS variable of interest includes both an estimate of the value and a margin of error.

10.2 ACS 5 year

You can use the tidycensus get_acs function to retrieve data for the ACS 5 year products, beginning with the 2005 - 2010 dataset.

The default end year for my version of tidycensus (as of Dec 4, 2018) is 2016 for the 2012-2016 ACS 5 year dataset.

10.3 Fetch List of ACS 5 year Variables

Let’s start by fetching ACS 5-year 2016 data on poverty.

We want to explore the number of folks living below the poverty level by census tract.

First we need to find the variable name(s)!

10.4 Load ACS Table Vars

Load the ACS 2012-2016 5 year data variables into a dataframe.

  • ACS 5 year datasets are referenced by end year in tidycensus!

Then take a look at the variable names, labels and concepts.

How many variables refer to the concept of poverty?

acs2016vars <- load_variables(year=2016, dataset = 'acs5', cache = T)
#View(acs2016vars)

10.5 ACS Tables and variables

Many hundreds (thousands?) more than for decennial census!

See the documentation on the census website

Types of tables:

  • B prefix = base tables
  • C = collapsed tables
  • DP = data profiles
  • S = Subject tables

10.6 Census Reporter

ACS variables can be confusing.

The Census Reporter website (https://censusreporter.org) provides another tool for navigating topics, tables, and variable names.

Let’s check it out to see what tables/variables we should use.

10.7 Filter the ACS Variables

In RStudio, view the dataframe acs2016vars and interactively filter the name column to display only the variables in the table C17002

Take a look at the different variables in this table.

What variable(s) contain the estimate of the number of people living below poverty?

10.8 get_acs

Use the tidycensus get_acs function to fetch the poverty data for census tracts in San Francisco

?get_acs

10.9 get_acs in action

Fetch the data in the table C17002 that contain the counts of people living below 100% of the poverty line.

sf_poor <- get_acs(geography = "tract",  
                   variables = c('C17002_002','C17002_003'), # poverty variables
                   year = 2016,          
                   state="CA",
                   summary_var = "C17002_001", # Est of num people - denom
                   county="San Francisco",
                   geometry=T)               
## Getting data from the 2012-2016 5-year ACS

10.10 View output

Let’s take a look at the output of get_acs and discuss how it differs from get_decennial.

sf_poor

10.11 Create Poverty Map, try 2

What are we mapping!

# What are we mapping?
plot(sf_poor['estimate'])

10.12 Create Poverty Map, try 2

# Remove census tracts that have no people!
sf_poor <- subset(sf_poor, summary_est > 0)

# What are we mapping?
plot(sf_poor['estimate'])

10.13 Calculating percents

Let’s calculate the percent below poverty by tract.

sf_poor <- sf_poor %>%
  mutate(pct = 100 * (estimate / summary_est))

head(sf_poor, 3)
## Simple feature collection with 3 features and 8 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -122.4267 ymin: 37.79947 xmax: -122.3996 ymax: 37.81144
## epsg (SRID):    4269
## proj4string:    +proj=longlat +datum=NAD83 +no_defs
##         GEOID                                               NAME
## 1 06075010100 Census Tract 101, San Francisco County, California
## 2 06075010100 Census Tract 101, San Francisco County, California
## 3 06075010200 Census Tract 102, San Francisco County, California
##     variable estimate moe summary_est summary_moe      pct
## 1 C17002_002      314 196        3972         291 7.905337
## 2 C17002_003      397 185        3972         291 9.994965
## 3 C17002_002      160 123        4300         442 3.720930
##                         geometry
## 1 MULTIPOLYGON (((-122.4206 3...
## 2 MULTIPOLYGON (((-122.4206 3...
## 3 MULTIPOLYGON (((-122.4267 3...

10.14 Group by and sum

We want to group the data by the geometry and then sum the data values so that we have one value per geometry.

sf_poor_summed <- sf_poor %>%
  select(GEOID, estimate, pct, geometry) %>%
  group_by(GEOID) %>% 
  summarise(count_below_pov = sum(estimate),
            pct_below_pov = sum(pct))

10.15 Group by and sum

head(sf_poor_summed)
## Simple feature collection with 6 features and 3 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: -122.4267 ymin: 37.79323 xmax: -122.3897 ymax: 37.81144
## epsg (SRID):    4269
## proj4string:    +proj=longlat +datum=NAD83 +no_defs
## # A tibble: 6 x 4
##   GEOID       count_below_pov pct_below_pov                       geometry
##   <chr>                 <dbl>         <dbl>                  <POLYGON [°]>
## 1 06075010100            711.         17.9  ((-122.4206 37.81111, -122.40…
## 2 06075010200            235.          5.47 ((-122.4267 37.80964, -122.42…
## 3 06075010300            625.         14.4  ((-122.4187 37.80593, -122.41…
## 4 06075010400            398.          7.93 ((-122.4149 37.80354, -122.41…
## 5 06075010500            229.          8.53 ((-122.4052 37.80476, -122.40…
## 6 06075010600            882.         24.7  ((-122.411 37.80117, -122.407…

10.16 Map Counts

Where are SF’s poorest areas?

plot(sf_poor_summed['count_below_pov'])

10.17 Map Percents

Where are SF’s poorest areas?

plot(sf_poor_summed['pct_below_pov'])

10.18 Challenge

The ACS 2013-2017 5 year dataset was released Dec 6, 2018.

Although my current version of tidycensus states that 2012-2016 is the latest ACS 5-year product, see if you can fetch & map the percent of people below poverty line in San Francisco using the 2013-2017 ACS 5-year data.

10.19 Challenge Solution

sf_poor_2017 <- get_acs(geography = "tract",  
                   variables = c('C17002_002','C17002_003'), # poverty variables
                   year = 2017,          
                   state="CA",
                   summary_var = "C17002_001", # Est of num people - denom
                   county="San Francisco",
                   geometry=T)   

head(sf_poor_2017)

10.20 Margins of Error (MOE)

We haven’t talked about it but it may be important in your work with ACS data.

Math is needed to combine MOEs when you combine variables.

  • tidycensus includes some nice functions for these calculations.

See this web page on how to handle MOEs in tidycensus

11 Questions?

12 Maps with tmap - Demo

12.1 tmap

The tmap package is great for making both static and interactive maps. It turns R into a GIS.

Let’s check it out with our last dataframe.

12.2 tmap

library(tmap)
## Warning: package 'tmap' was built under R version 3.4.4
tmap_mode("view") # set mode to interactive
## tmap mode set to interactive viewing
poverty_map <- tm_shape(sf_poor_summed) +
                  tm_polygons(col="pct_below_pov")

12.3 tmap

View the map - click on tracts

poverty_map

12.4 tmap

There are a number of great tutorials online for working with tmap.

See the References at the end of this workshop document.

13 Census Geographic Data Files

13.1 Census Geographic Data Files

Cartographic Boundary vs Detailed TIGER/Line data

By default, tidycensus downloads census cartographic boundary data.

  • These are simplifed geometries, clipped to coastlines.

In get_acs you can also request the more detailed census TIGER/Line data.

The cartographic boundary data is great for mapping but the detailed data is often better for analysis.

Let’s check it out.

13.2 Fetch Cartographic Boundary Data

sf_poor_cb <- get_acs(geography = "tract",   
                   variables = c('C17002_002','C17002_003'), # poverty variables
                   summary_var = "C17002_001",
                   year = 2016,           
                   state="CA",
                   county="San Francisco",
                   geometry=TRUE,
                   cb = TRUE)     # THIS IS THE DEFAULT!
## Getting data from the 2012-2016 5-year ACS

13.3 Fetch Detailed TIGER/Line Geometry

sf_poor_tl <- get_acs(geography = "tract",   
                   variables = c('C17002_002','C17002_003'), # poverty variables       
                   summary_var = "C17002_001",
                   year = 2016,              
                   state="CA",
                   county="San Francisco",
                   geometry=TRUE,
                   cb = FALSE)  # Fetching the TIGER/Line data  
## Getting data from the 2012-2016 5-year ACS

13.4 Visualize differences with Tmap

zoom in to explore, especially around the coastline.

tm_shape(sf_poor_tl) + tm_borders() +
tm_shape(sf_poor_cb) + tm_borders(col="red")

14 Questions?

15 Summary

15.1 Summary

  • tidycensus offers two key functions for fetching census tabular and geographic: get_acs and get_decennial

  • Using tidycensus to fetch the tabular data or both tabular and geographic data is IMOH way easier than any alternatives, IF you (1) know R, (2)know a bit about working with geographic data in R.

  • This approach is also scaleable if you want multiple census variables and geographies.

  • If you just want to fetcch the geographic data it may be easier to use the tigris package or download it directly from the census.

16 Extras for Enthusiasts

16.1 Scaling Up Example

In this example we show you how you can read in census variables of interest from a file into an R dataframe. You can then use that dataframe to fetch data for all those variables using tidycensus.

# Load cenvar lookup table of vars of interest
my_cenvar_df <-read.csv("data/cenvar_lookup.csv", strip.white = T, stringsAsFactors = F)

my_cenvar_df
##              my_cen_var_names my_cen_vars
## 1          citizenship_totpop B05001_001E
## 2     citizenship_non_citizen B05001_006E
## 3                entry_totpop B05005_001E
## 4                  entry_2010 B05005_002E
## 5             entry_2000_2009 B05005_007E
## 6           birthplace_totpop B05007_001E
## 7            birthplace_europ B05007_014E
## 8            birthplace_asian B05007_027E
## 9     birthplace_latinAmerica B05007_040E
## 10    birthplace_southAmerica B05007_081E
## 11    birthplace_other_nonUSA B05007_094E
## 12    birthplace_byage_totpop B06001_001E
## 13     birthplace_byage_fborn B06001_049E
## 14             poverty_totpop B06012_001E
## 15                  below_pov B06012_002E
## 16                 below_pov2 B06012_003E
## 17       poverty_fborn_totpop B06012_017E
## 18            below_pov_fborn B06012_018E
## 19           below_pov2_fborn B06012_019E
## 20       health_native_totpop B27020_002E
## 21  health_native_noinsurance B27020_006E
## 22    health_fborn_nat_totpop B27020_008E
## 23 fborn_nohealth_naturalized B27020_012E
## 24 health_fborn_noncit_totpop B27020_013E
## 25  fborn_nohealth_noncitizen B27020_017E

16.2 Fetch the ACS data

Fetch the ACS data for these variables for the 9 county bay area

nine_counties <- c("001", "075", "013", "041", "055", "081", "085", "095", "097")
bay9_data <-get_acs(geography = "tract", 
                       variables = my_cenvar_df$my_cen_vars, 
                       year=2016,
                       state = "CA", 
                       county = nine_counties, 
                       geometry = T)
## Getting data from the 2012-2016 5-year ACS
bay9_data
## Simple feature collection with 39700 features and 5 fields (with 150 geometries empty)
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -123.5335 ymin: 36.89303 xmax: -121.2082 ymax: 38.86424
## epsg (SRID):    4269
## proj4string:    +proj=longlat +datum=NAD83 +no_defs
## First 10 features:
##          GEOID                                          NAME   variable
## 1  06001400100 Census Tract 4001, Alameda County, California B05001_001
## 2  06001400100 Census Tract 4001, Alameda County, California B05001_006
## 3  06001400100 Census Tract 4001, Alameda County, California B05005_001
## 4  06001400100 Census Tract 4001, Alameda County, California B05005_002
## 5  06001400100 Census Tract 4001, Alameda County, California B05005_007
## 6  06001400100 Census Tract 4001, Alameda County, California B05007_001
## 7  06001400100 Census Tract 4001, Alameda County, California B05007_014
## 8  06001400100 Census Tract 4001, Alameda County, California B05007_027
## 9  06001400100 Census Tract 4001, Alameda County, California B05007_040
## 10 06001400100 Census Tract 4001, Alameda County, California B05007_081
##    estimate moe                       geometry
## 1      3018 195 MULTIPOLYGON (((-122.2469 3...
## 2       218 106 MULTIPOLYGON (((-122.2469 3...
## 3       944 168 MULTIPOLYGON (((-122.2469 3...
## 4        61  50 MULTIPOLYGON (((-122.2469 3...
## 5       176  99 MULTIPOLYGON (((-122.2469 3...
## 6       843 156 MULTIPOLYGON (((-122.2469 3...
## 7       208  75 MULTIPOLYGON (((-122.2469 3...
## 8       546 141 MULTIPOLYGON (((-122.2469 3...
## 9        46  43 MULTIPOLYGON (((-122.2469 3...
## 10       11  17 MULTIPOLYGON (((-122.2469 3...

16.3 Reformat Ouput

  1. We only want to keep the estimate column for each variable of interest, plus the GEOID and geometry columns.

  2. We then want to make the data wide using the spread function. This will put each estimate variable is in its own column.

bay9_data2 <- bay9_data %>%
  select("GEOID", "variable", "estimate") %>%
  spread(key=variable, value=estimate)

16.4 Take a look

bay9_data2
## Simple feature collection with 1588 features and 26 fields (with 6 geometries empty)
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -123.5335 ymin: 36.89303 xmax: -121.2082 ymax: 38.86424
## epsg (SRID):    4269
## proj4string:    +proj=longlat +datum=NAD83 +no_defs
## First 10 features:
##          GEOID B05001_001 B05001_006 B05005_001 B05005_002 B05005_007
## 1  06001400100       3018        218        944         61        176
## 2  06001400200       1960         92        315         33         55
## 3  06001400300       5236        317        900        130        314
## 4  06001400400       4171        190        524         69        115
## 5  06001400500       3748        430        701        226        123
## 6  06001400600       1661         62        205         21          9
## 7  06001400700       4552        353        619        122        177
## 8  06001400800       3506        276        767        108        185
## 9  06001400900       2262        202        446         20         16
## 10 06001401000       6193        477        754         90        186
##    B05007_001 B05007_014 B05007_027 B05007_040 B05007_081 B05007_094
## 1         843        208        546         46         11         43
## 2         243         65        136         28         19         14
## 3         857        119         90        257         86        391
## 4         471        146        228         35         15         62
## 5         635        160         83        245         37        147
## 6         178         49         74         38          0         17
## 7         587         70        145        299         67         73
## 8         741        181        330        149          8         81
## 9         405         20        136        148         38        101
## 10        671         56         46        519         67         50
##    B06001_001 B06001_049 B06012_001 B06012_002 B06012_003 B06012_017
## 1        3018        843       3011        113         20        843
## 2        1960        243       1952        106         84        243
## 3        5236        857       5153        450        217        848
## 4        4171        471       4158        268        198        471
## 5        3748        635       3733        339        240        635
## 6        1661        178       1656        158        229        178
## 7        4552        587       4552        820        723        587
## 8        3506        741       3457        381        348        741
## 9        2262        405       2228        358        268        405
## 10       6193        671       6184       1466        768        671
##    B06012_018 B06012_019 B27020_002 B27020_006 B27020_008 B27020_012
## 1          31         12       2175         88        625         12
## 2          31         22       1717         38        151          0
## 3         124        183       4379        111        540         34
## 4          52         82       3694        152        281          0
## 5         151         58       3113        243        205          6
## 6          59         21       1483        143        116          8
## 7         107        103       3965        512        234         33
## 8          75        191       2759        184        465         99
## 9          66         39       1857        103        203         15
## 10        120         17       5513        463        194          0
##    B27020_013 B27020_017                       geometry
## 1         218         10 MULTIPOLYGON (((-122.2469 3...
## 2          92         35 MULTIPOLYGON (((-122.2574 3...
## 3         317         15 MULTIPOLYGON (((-122.2642 3...
## 4         190         21 MULTIPOLYGON (((-122.2618 3...
## 5         430        148 MULTIPOLYGON (((-122.2694 3...
## 6          62          0 MULTIPOLYGON (((-122.2681 3...
## 7         353        147 MULTIPOLYGON (((-122.2779 3...
## 8         276         79 MULTIPOLYGON (((-122.2887 3...
## 9         202         28 MULTIPOLYGON (((-122.2856 3...
## 10        477        111 MULTIPOLYGON (((-122.2787 3...

16.5 Rename the columns

Use the dataframe of census variables to rename the columns so that they are self-describing.

colnames(bay9_data2)<-c("GEOID", my_cenvar_df$my_cen_var_names, "geometry")

16.6 Take a look

bay9_data2
## Simple feature collection with 1588 features and 26 fields (with 6 geometries empty)
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -123.5335 ymin: 36.89303 xmax: -121.2082 ymax: 38.86424
## epsg (SRID):    4269
## proj4string:    +proj=longlat +datum=NAD83 +no_defs
## First 10 features:
##          GEOID citizenship_totpop citizenship_non_citizen entry_totpop
## 1  06001400100               3018                     218          944
## 2  06001400200               1960                      92          315
## 3  06001400300               5236                     317          900
## 4  06001400400               4171                     190          524
## 5  06001400500               3748                     430          701
## 6  06001400600               1661                      62          205
## 7  06001400700               4552                     353          619
## 8  06001400800               3506                     276          767
## 9  06001400900               2262                     202          446
## 10 06001401000               6193                     477          754
##    entry_2010 entry_2000_2009 birthplace_totpop birthplace_europ
## 1          61             176               843              208
## 2          33              55               243               65
## 3         130             314               857              119
## 4          69             115               471              146
## 5         226             123               635              160
## 6          21               9               178               49
## 7         122             177               587               70
## 8         108             185               741              181
## 9          20              16               405               20
## 10         90             186               671               56
##    birthplace_asian birthplace_latinAmerica birthplace_southAmerica
## 1               546                      46                      11
## 2               136                      28                      19
## 3                90                     257                      86
## 4               228                      35                      15
## 5                83                     245                      37
## 6                74                      38                       0
## 7               145                     299                      67
## 8               330                     149                       8
## 9               136                     148                      38
## 10               46                     519                      67
##    birthplace_other_nonUSA birthplace_byage_totpop birthplace_byage_fborn
## 1                       43                    3018                    843
## 2                       14                    1960                    243
## 3                      391                    5236                    857
## 4                       62                    4171                    471
## 5                      147                    3748                    635
## 6                       17                    1661                    178
## 7                       73                    4552                    587
## 8                       81                    3506                    741
## 9                      101                    2262                    405
## 10                      50                    6193                    671
##    poverty_totpop below_pov below_pov2 poverty_fborn_totpop
## 1            3011       113         20                  843
## 2            1952       106         84                  243
## 3            5153       450        217                  848
## 4            4158       268        198                  471
## 5            3733       339        240                  635
## 6            1656       158        229                  178
## 7            4552       820        723                  587
## 8            3457       381        348                  741
## 9            2228       358        268                  405
## 10           6184      1466        768                  671
##    below_pov_fborn below_pov2_fborn health_native_totpop
## 1               31               12                 2175
## 2               31               22                 1717
## 3              124              183                 4379
## 4               52               82                 3694
## 5              151               58                 3113
## 6               59               21                 1483
## 7              107              103                 3965
## 8               75              191                 2759
## 9               66               39                 1857
## 10             120               17                 5513
##    health_native_noinsurance health_fborn_nat_totpop
## 1                         88                     625
## 2                         38                     151
## 3                        111                     540
## 4                        152                     281
## 5                        243                     205
## 6                        143                     116
## 7                        512                     234
## 8                        184                     465
## 9                        103                     203
## 10                       463                     194
##    fborn_nohealth_naturalized health_fborn_noncit_totpop
## 1                          12                        218
## 2                           0                         92
## 3                          34                        317
## 4                           0                        190
## 5                           6                        430
## 6                           8                         62
## 7                          33                        353
## 8                          99                        276
## 9                          15                        202
## 10                          0                        477
##    fborn_nohealth_noncitizen                       geometry
## 1                         10 MULTIPOLYGON (((-122.2469 3...
## 2                         35 MULTIPOLYGON (((-122.2574 3...
## 3                         15 MULTIPOLYGON (((-122.2642 3...
## 4                         21 MULTIPOLYGON (((-122.2618 3...
## 5                        148 MULTIPOLYGON (((-122.2694 3...
## 6                          0 MULTIPOLYGON (((-122.2681 3...
## 7                        147 MULTIPOLYGON (((-122.2779 3...
## 8                         79 MULTIPOLYGON (((-122.2887 3...
## 9                         28 MULTIPOLYGON (((-122.2856 3...
## 10                       111 MULTIPOLYGON (((-122.2787 3...

16.7 Fetching data for multiple years

This requires variable name to be the same across years!

# use purr::map_df to get data for multiple years (must have same vars!)
pop90_10 <- map_df(c(1990, 2000, 2010), function(x) { 
  get_decennial(geography = "state",
  variables = c(totalpop = "P001001"),
  dataset = "sf1",
  year = x) %>%
  mutate(year = x) }
)

# View output
head(pop90_10)
tail(pop90_10)

# Plot it
pop90_10 %>% ggplot(aes(x=reorder(NAME,value), y=value/1000000, fill=factor(year))) + 
             geom_bar(stat="identity", position=position_dodge()) + coord_flip()

17 Combining Census Data with Other Data

17.1 Area Weighted Interpolation

One of the strenghts of the sf package is how relatively easy it is to reaggregate data from one geometry to another. This process is called areal interpolation.

Area weighted interpolation reaggregates the data based on the percent of area shared by input and output geometeries.

17.2 Read in a Shapefile

sfnhoods<- st_read("data/sfnhoods.shp")
head(sfnhoods)
plot(sfnhoods['nhood'])

17.3 Check the CRS

st_crs(sfnhoods)
st_crs(sf_poor5)

17.4 CRS transformation

sf_poor5_4326 = st_transform(sf_poor5, st_crs(sfnhoods))

17.5 Area Weighted Interpolation

Reaggregate percent of people below poverty from census tract to neighborhood polygons.

sfhoods2 = st_interpolate_aw(sf_poor5_4326[, "pct_below_pov"], sfnhoods,
extensive = F) # True= aw sum; False= aw avg

17.6 Map it

par(mfrow=c(1,2))
plot(sf_poor5['pct_below_pov'])
plot(sfhoods2['pct_below_pov'])
par(mfrow=c(1,1))

17.7 Map it with tmap

tm_shape(sfhoods2) +
   tm_polygons(col="pct_below_pov")

17.8 Combine the values

head(sfhoods2)
sfnhoods$pct_below_pov <- sfhoods2$pct_below_pov

# map again - click on polygons and view data in popups
# to confirm the AWI output values
tm_shape(sfnhoods) +
  tm_polygons(col="pct_below_pov", 
    popup.vars = c("nhood", "pct_below_pov")
  )